When Back Propagation meets brains: The Optimizers Chronicles
Optimizers
“Optimizers help to get results faster”
1. What is an Optimizer?
In deep learning, an optimizer is an algorithm used to adjust the weights and biases of a neural network to minimize the loss function. The loss function tells us how far the model’s predictions are from the actual values.
Optimizers are algorithms or methods used to change the attributes of your neural network such as weights and learning rate in order to reduce the losses.
Without optimization, training a neural network would be inefficient, slow, or even impossible in complex, high-dimensional parameter spaces.
Optimizers enable the training of massive models like GPT-3 and vision transformers. The optimizer plays a critical role in finding the best parameters (weights) that help the model perform well.
2. Why Do We Need Optimizers?
a) High-Dimensional and Complex Loss Surfaces
Neural networks have millions of parameters, making the loss function’s surface very high-dimensional.
This surface can have valleys (saddle point), hills, and flat regions (plateau). A simple algorithm like basic gradient descent may struggle to navigate such a surface efficiently.
Optimizers help navigate this landscape to find a minimum of the loss function efficiently.
b) Faster Convergence
Standard gradient descent may converge very slowly due to oscillations or poor learning rates.
Some optimizers like Adam and Momentum can accelerate the convergance and help the model reach the minimum of the loss function more quickly than basic gradient descent.
c) Avoid Getting Stuck
Optimizers like Adam or Momentum can help the model avoid getting stuck in local minima or flat regions of the loss surface.
d) Efficient Training
Training large neural networks can take a long time. Optimizers are designed to reduce computation time while improving accuracy.
e) Vanishing and Exploding Gradients
Gradients may become very small (vanish) or very large (explode), especially in deep networks.
Optimizers like Adam or RMSProp handle these issues by using adaptive learning rates.
f) Overfitting
Optimizers with regularization techniques (e.g., dropout, weight decay) can help prevent overfitting.
3. Example: Why Optimizers Matter
Without an Optimizer
Imagine training a deep learning model using vanilla gradient descent:
The model might move too slowly in regions where the gradient is small.
It may oscillate back and forth in steep regions, wasting time.
It could stop progressing if it gets stuck in a flat area (plateau) or saddle points.
With a Good Optimizer
Using an advanced optimizer like Adam or momentum: - The optimizer dynamically adjusts the learning rate for each parameter, speeding up training. - It combines techniques like momentum to smooth the updates and prevent oscillations. - It avoids flat regions and quickly finds a better solution.
4. Types of Optimizers and Their Role
4.1. Gradient Descent
Simplest and most basic optimizer.
Updates parameters using the gradient of the loss: \[
\theta_{t+1} = \theta_t - \eta \nabla J(\theta_t)
\]
\(\theta_t\): Current parameters.
\(\eta\): Learning rate.
\(\nabla J(\theta_t)\): Gradient of the loss function.
Challenges of Vanilla Gradient Descent
Slow convergence: Especially on flat or high-curvature regions.
Local minima: May get stuck in poor solutions.
Sensitivity to learning rate: Choosing \(\eta\) is crucial and difficult.
4.2. Stochastic Gradient Descent (SGD):
Uses a random subset of data to compute gradients, making updates faster.
4.3. Advanced Optimizers
To address the limitations of basic gradient descent, advanced optimizers incorporate additional mechanisms:
4.3.1. Momentum
Accumulates past gradients to smooth updates and dampen oscillations.
Helps in high-curvature regions or noisy gradients.
4.3.2. RMSprop
Scales the learning rate by dividing by the root mean square of recent gradients.
Handles vanishing gradients by normalizing updates.
4.3.3. Adam (Adaptive Moment Estimation)
Combines the benefits of momentum and RMSprop.
Maintains running averages of both gradients and their squares.
Works well in most practical deep learning scenarios.
Exponentially Weighted Moving Average (EWMA)
1. What is EWMA?
The Exponentially Weighted Moving Average (EWMA) is a technique that calculates a weighted average of past observations, where the weights decrease exponentially for older data points. This gives more importance to recent data while retaining some influence from historical data.
It is based on the assumption that more recent values of a variable contribute more to the formation of the next value than precedent values .
The Exponentially Weighted Moving Average (EWMA) is commonly used as a smoothing technique in time series. However, due to several computational advantages (fast, low-memory cost), the EWMA is behind the scenes of many optimization algorithms in deep learning, including Gradient Descent with Momentum, RMSprop, Adam, etc.
Formula
The formula for EWMA is: \[
\large v_t = \beta \cdot v_{t-1} + (1 - \beta) \cdot x_t
\]
Components:
\(v_t\) : The current EWMA value.
\(v_{t-1}\): The previous EWMA value.
\(\beta\) : The smoothing factor or decay rate (\(0 \leq \beta < 1\)).
\(x_t\) : The current data point.
Explanation:
The current EWMA value, \(v_t\), is a combination of:
A fraction (\(\beta\)) of the previous EWMA value (\(v_{t-1}\)).
A fraction (\(1 - \beta\)) of the current data point (\(x_t\)).
import pandas as pd;import datetime as dt;import matplotlib.pyplot as plt;import yfinance as yfimport seaborn as snsstrt = dt.datetime(2020,1,1)end = dt.datetime(2021,1,2)data=yf.download("SPY", strt, end)plt.figure(figsize=(14,6))plt.scatter(data.index, data[('Open', 'SPY')])plt.title("A time series data")plt.show()
[*********************100%***********************] 1 of 1 completed
If we plot this in red, we can see that what we get is a moving average of the daily price, it’s like a smooth, less noisy curve.
Lets explain a bit more the general equation: \[
\large v_t = \overbrace{\beta \cdot v_{t-1}}^{\textcolor{red}{\text{trend}}} + \underbrace{(1 - \beta) \cdot x_t}_{\textcolor{red}{\text{current value}}}
\]
We can see that the value of \(\beta\) determines how important the previous value is (the trend), and \((1-\beta)\) how important the current value is.
Take a value of \(\beta = 0.95\) and plot, notice that the curve is smoother because the trend now is more important (and the current price value is less important), so it will adapt more slowly when the price changes.
Lets try the other extreme and set \(\beta = 0.5\), this way the graph we get is noisier, because it is more susceptible to the current values (and this includes outliers). It adapts more quickly to recent changes in price.
Code
data = [5, 6, 4, 7, 5, 10, 8, 6, 7]indexx = [i for i inrange(len(data))]beta_values = [0.1, 0.5, 0.9]def ewma(data, beta): v = data[0] ewma_values = [v]for x in data[1:]: v = beta * v + (1- beta) * x ewma_values.append(v)return ewma_values# Plotfor beta in beta_values: ewma_values = ewma(data, beta) plt.plot(ewma_values, label=f"EWMA (β={beta})")plt.scatter(indexx, data, label="Original Data", marker="o", s=50, color ="red")plt.title(r"Impact of $\beta$ on EWMA")plt.xlabel("Time")plt.ylabel("Value")plt.legend()plt.grid()plt.show()
Low \(\beta\) (e.g., \(\beta = 0.1\))
Behavior:
Gives more weight to recent data (\(1 - \beta\) is large).
Reacts quickly to changes or noise in the data.
Results in a less smooth curve that closely follows the data points.
Interpretation:
We can think of this as “overfitting” to recent data because the moving average reflects every fluctuation, even noise.
High \(\beta\) (e.g., \(\beta = 0.9\))
Behavior:
Gives more weight to historical data (\(\beta\) is large).
Reacts slowly to new data points.
Results in a smoother curve that may lag behind significant trends.
Interpretation:
We can think of this as “underfitting” because the moving average ignores small changes and emphasizes long-term trends.
\[\beta = \frac{2}{n+1}\] where, \(n\) = number of past data points contributing significantly to the moving average.
This formula provides a way to approximate \(\beta\) based on the desired “smoothing window” size (\(n\)).
For \(0 < \beta < 1\), as we keep expanding the equation, it becomes clear that each new value \(v_t\) is a weighted sum of all previous inputs, with exponentially decaying weights.
Specifically, \(\beta^3 < \beta^2 < \beta\), implying that the influence of older values diminishes over time. Thus, recent values (or updates in the case of deep learning optimizers) contribute more to the final result.
This behavior makes the EWMA a useful tool in optimizers like Adam, where the moving average of gradients and squared gradients is used to adjust learning rates and stabilize optimization.
7. Applications
Time Series Analysis:
Smoothing noisy data.
Finance:
Tracking trends in stock prices or financial metrics.
Deep Learning:
Used in optimizers like Adam for gradient smoothing.
Process Control:
Monitoring and detecting changes in production quality.
8. Advantages
Simple to implement.
Adaptive to recent changes.
Smooths out noise effectively.
9. Disadvantages
Requires tuning of \(\beta\).
Sensitive to outliers if too small.
Contour plots
A contour plot is a graphical representation used to visualize a 3D surface on a 2D plane.
It uses contours (lines or curves) to represent regions of constant value, making it ideal for showing the relationship between two variables and a third dependent variable.
1. Why?
In the context of deep learning:
Contour plots are often used to visualize the loss landscape during model training. This helps illustrate how the loss function changes with respect to model parameters, such as weights and biases.
They are helpful in understanding concepts like gradient descent, showing how optimization algorithms minimize the loss by moving through the contours.
They provide insights into whether the loss landscape is smooth, convex, or contains local minima, which affects the optimization process.
The bottom part of the diagram shows some contour lines with a straight line running through the location of the maximum value. The curve at the top represents the values along that straight line
import numpy as npimport matplotlib.pyplot as plt# Define a loss functiondef loss_function(w1, w2):return w1**2+ w2**2# Gradient of the loss functiondef gradient(w1, w2): grad_w1 =2* w1 grad_w2 =2* w2 return np.array([grad_w1, grad_w2])# Gradient Descent Parameterslearning_rate =0.1max_iter =50tolerance =1e-5# Starting pointw = np.array([2.5, 2.5]) # Initial weightstrajectory = [w.copy()] # To store the path of gradient descent# Perform gradient descentfor i inrange(max_iter): grad = gradient(w[0], w[1]) w -= learning_rate * grad trajectory.append(w.copy())# if np.linalg.norm(grad) < tolerance:# breaktrajectory = np.array(trajectory)# Generate grid data for plottingw1 = np.linspace(-3, 3, 100)w2 = np.linspace(-3, 3, 100)W1, W2 = np.meshgrid(w1, w2)Z = loss_function(W1, W2)# Create the plotsfig = plt.figure(figsize=(14, 6))# 3D Surface Plotax1 = fig.add_subplot(1, 2, 1, projection='3d')ax1.plot_surface(W1, W2, Z, alpha=0.5, edgecolor='k', cmap='viridis')ax1.view_init(elev=60, azim=255,roll =0)ax1.set_title("3D Surface Plot with Gradient Descent Path",size =20)ax1.set_xlabel("Weight 1 (w1)",size =15)ax1.set_ylabel("Weight 2 (w2)",size =15)ax1.set_zlabel("Loss", rotation=90,size =15)# Plot gradient descent path in 3Dtrajectory_z = loss_function(trajectory[:, 0], trajectory[:, 1])ax1.plot(trajectory[:, 0], trajectory[:, 1], trajectory_z, color='red', marker='*', label="GD Path", alpha=1)ax1.legend()# Contour Plotax2 = fig.add_subplot(1, 2, 2)contour = ax2.contour(W1, W2, Z, levels=30, cmap='viridis')ax2.clabel(contour, inline=True, fontsize=8)ax2.set_title("Contour Plot with Gradient Descent Path",size =20)ax2.set_xlabel("Weight 1 (w1)",size =15)ax2.set_ylabel("Weight 2 (w2)",size =15)plt.colorbar(contour, ax=ax2).set_label("Loss", size=15)# Plot gradient descent path on contour plotax2.plot(trajectory[:, 0], trajectory[:, 1], color='red', marker='o', label="GD Path")ax2.legend()plt.tight_layout()plt.show()
Momentum Gradient Descent
Momentum (magenta) vs. Gradient Descent (cyan) on a surface with a global minimum (the left well) and local minimum (the right well)
Gradient Descent (GD) is a foundational optimization algorithm in machine learning. However, Momentum Gradient Descent improves upon it by addressing some of its weaknesses, such as slow convergence and oscillations. Let’s understand why we need it, what it is, and how it works in detail.
1. Why Do We Need Momentum Gradient Descent?
Problems with Standard Gradient Descent
Oscillations in High Curvature Regions:
When the loss function has steep valleys (high curvature), standard gradient descent oscillates back and forth because gradients are large in steep directions.
This slows down convergence and wastes computation.
A high curvature (pathological) surface with reds representing highest values and blues representing the lowest values.
Path by GD vs Ideal path
Slow Convergence in Flat Regions:
In regions with low gradient values (flat areas), gradient descent moves very slowly.
Real-Life Analogy 📦
Imagine you’re rolling a ball down a mountain:
Standard Gradient Descent: The ball moves only in the direction of the slope at that point. If the slope changes frequently, the ball will zigzag (oscillate) and move inefficiently.
Momentum: The ball builds up velocity (speed + direction) based on previous slopes. Even when the slope temporarily flattens, the ball keeps moving due to its accumulated momentum.
Momentum helps accelerate learning and smooth out oscillations.
2. What is Momentum Gradient Descent?
Momentum Gradient Descent modifies standard gradient descent by introducing a velocity term. This term considers the past gradients (historical information) to decide the direction and speed of the parameter updates.
It is an optimization algorithm that builds upon standard gradient descent to accelerate convergence, especially in scenarios involving noisy gradients or valleys in the loss surface.
In simpler terms: - Standard GD moves using just the current gradient. - Momentum GD combines past gradients (momentum) with the current gradient.
3. How Does Momentum Gradient Descent Work?
Momentum Gradient Descent smoothens parameter updates by introducing a velocity term that accumulates past gradients and applies an exponentially weighted moving average (EWMA):
def gradient(theta):return2* theta # Gradient of the loss function J(theta) = theta^2def loss_function(theta):return theta **2# Loss function J(theta) = theta^2def momentum_gd_ewma(theta_init, learning_rate, beta, num_steps): theta = theta_init v =0# Initialize velocity (EWMA starts at 0) history = []for step inrange(num_steps): grad = gradient(theta) v = beta * v + learning_rate * grad # EWMA formula for momentum theta -= v loss = loss_function(theta) # Calculate loss history.append(theta)print(f"Step {step+1}: Parameter = {theta:.4f}, Loss = {loss:.4f}")if loss <1e-2: # Stop if loss is very smallbreakreturnNone# Parameterstheta_initial =10learning_rate =0.1beta_1 =0.5steps =50# Run the Momentum Gradient Descentmomentum_gd_ewma(theta_initial, learning_rate, beta_1, steps)
Step 1: Parameter = 8.0000, Loss = 64.0000
Step 2: Parameter = 5.4000, Loss = 29.1600
Step 3: Parameter = 3.0200, Loss = 9.1204
Step 4: Parameter = 1.2260, Loss = 1.5031
Step 5: Parameter = 0.0838, Loss = 0.0070
Batch GD
theta_initial =10learning_rate =0.1beta_1 =0# making GD like Batch GDsteps =50# Run the Momentum Gradient Descentmomentum_gd_ewma(theta_initial, learning_rate, beta_1, steps)
Step 1: Parameter = 8.0000, Loss = 64.0000
Step 2: Parameter = 6.4000, Loss = 40.9600
Step 3: Parameter = 5.1200, Loss = 26.2144
Step 4: Parameter = 4.0960, Loss = 16.7772
Step 5: Parameter = 3.2768, Loss = 10.7374
Step 6: Parameter = 2.6214, Loss = 6.8719
Step 7: Parameter = 2.0972, Loss = 4.3980
Step 8: Parameter = 1.6777, Loss = 2.8147
Step 9: Parameter = 1.3422, Loss = 1.8014
Step 10: Parameter = 1.0737, Loss = 1.1529
Step 11: Parameter = 0.8590, Loss = 0.7379
Step 12: Parameter = 0.6872, Loss = 0.4722
Step 13: Parameter = 0.5498, Loss = 0.3022
Step 14: Parameter = 0.4398, Loss = 0.1934
Step 15: Parameter = 0.3518, Loss = 0.1238
Step 16: Parameter = 0.2815, Loss = 0.0792
Step 17: Parameter = 0.2252, Loss = 0.0507
Step 18: Parameter = 0.1801, Loss = 0.0325
Step 19: Parameter = 0.1441, Loss = 0.0208
Step 20: Parameter = 0.1153, Loss = 0.0133
Step 21: Parameter = 0.0922, Loss = 0.0085
Here,
With momentum (\(\beta=0.5\)), the algorithm “remembers” the previous gradients and accelerates the descent, converging in 5 steps.
Without momentum (\(\beta=0\)), each step is based solely on the current gradient, so it converges more slowly, taking 21 steps.
The optimizer is able to “skip over” areas where the gradient changes direction frequently, enabling faster progress.
The addition of momentum significantly speeds up convergence in gradient descent.
When momentum is used, past gradients help the algorithm move faster, especially in the directions where the gradient is consistently large.
Convergance Time taken
Code
import numpy as npimport matplotlib.pyplot as plt# Define the gradient and loss functiondef gradient(theta):return2*-theta**2* (np.sin(theta))**2+2*theta*np.cos(theta)*np.sin(theta) + theta**2* (np.cos(theta))**2# Gradient of the loss function J(theta) = theta^2def loss_function(theta):return2*(theta **2)* np.sin(theta) * np.cos(theta) # Loss function J(theta) = theta^2# Gradient Descent without Momentumdef gd_no_momentum(theta_init, learning_rate, num_steps): theta = theta_init history = []for step inrange(num_steps): grad = gradient(theta) theta -= learning_rate * grad history.append(theta)# if loss_function(theta) < 1e-2: # Stop if loss is very small# breakreturn np.array(history)# Gradient Descent with Momentum (EWMA)def momentum_gd_ewma(theta_init, learning_rate, beta, num_steps): theta = theta_init v =0# Initialize velocity (EWMA starts at 0) history = []for step inrange(num_steps): grad = gradient(theta) v = beta * v + learning_rate * grad # EWMA formula for momentum theta -= v history.append(theta)# if loss_function(theta) < 1e-2: # Stop if loss is very small# breakreturn np.array(history)# Parameterstheta_initial =0.9learning_rate =0.2beta =0.9# Momentum parametersteps1 =500steps2 =20# Run Gradient Descent without Momentumhistory_no_momentum = gd_no_momentum(theta_initial, learning_rate, steps1)# Run Gradient Descent with Momentumhistory_momentum = momentum_gd_ewma(theta_initial, learning_rate, beta, steps2)# Plotting the convergence pathstheta_vals = np.linspace(-1.4, 1.4, 400)loss_vals = loss_function(theta_vals)plt.figure(figsize=(14, 6))# Plot Loss Curveplt.plot(theta_vals, loss_vals, label=r'Loss Curve $J(\theta) = \theta^2$', color='grey', linestyle='--')# Plot path of Gradient Descent without Momentumplt.plot(history_no_momentum, loss_function(np.array(history_no_momentum)), 'ro-', label=f'Without Momentum, steps taken = {steps1}', markersize=6)# Plot path of Gradient Descent with Momentumplt.plot(history_momentum, loss_function(np.array(history_momentum)), 'bo-', label=f'With Momentum, steps taken = {steps2}', markersize=6)plt.xlabel(r'$\Theta$', size =15)plt.ylabel(r'Loss $J(\theta)$', size =15)plt.suptitle('Convergence of Gradient Descent', size =18)plt.title("With and Without Momentum",size =15)plt.legend(fontsize=15)plt.grid(True)plt.show()
Effect of Momentum:
When \(\beta=0\) :
No momentum is applied, so each step is solely based on the current gradient and works like Vanilla Gradient Descent.
The updates are more erratic, and the convergence is slower because the optimizer doesn’t “remember” previous gradients.
This results in more steps to converge to the minimum.
When \(0<\beta<1\) :
The previous gradients have more influence on the update, causing the optimizer to build up momentum.
This results in faster convergence because the parameter updates accumulate in a consistent direction, reducing oscillations and speeding up the overall optimization process.
The larger the \(\beta\), the more momentum is carried over from previous steps.
Code
import numpy as npimport matplotlib.pyplot as pltimport plotly.graph_objects as goimport plotly.express as pxfrom plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplotimport cufflinks as cfinit_notebook_mode(connected=True)cf.go_offline()# Define the gradient and loss functiondef gradient(theta):return2*-theta**2* (np.sin(theta))**2+2*theta*np.cos(theta)*np.sin(theta) + theta**2* (np.cos(theta))**2# Gradient of the loss function J(theta) = theta^2def loss_function(theta):return2*(theta **2)* np.sin(theta) * np.cos(theta) # Loss function J(theta) = theta^2# Gradient Descent with Momentum (EWMA)def momentum_gd_ewma2(theta_init, learning_rate, beta, num_steps): theta = theta_init v =0# Initialize velocity (EWMA starts at 0) history = []for step inrange(num_steps): grad = gradient(theta) v = beta * v + learning_rate * grad # EWMA formula for momentum theta -= v history.append(theta)# if loss_function(theta) < 1e-2: # Stop if loss is very small# breakreturn np.array(history)# Plotting the convergence pathstheta_vals = np.linspace(-1.4, 1.4, 400)loss_vals = loss_function(theta_vals)# Parameterstheta_initial =0.9learning_rate =0.2beta = [0.9, 0.7, 0.5, 0.2] # Momentum parameternum_steps = [20, 50, 500, 500]colors = ['red','blue', 'green', 'purple']fig = go.Figure()fig.add_trace(go.Scatter(x=theta_vals, y=loss_vals, mode='lines', name='Loss Curve', line=dict(color='black', dash='dash')))for i inrange(len(beta)): history = momentum_gd_ewma2(theta_initial, learning_rate, beta[i], num_steps[i]) loss_history = loss_function(history) fig.add_trace(go.Scatter(x=history, y=loss_history, mode='lines+markers', name=f'β={beta[i]},steps = {num_steps[i]}', marker=dict(color=colors[i],size=8), line=dict(color=colors[i],width=2), hovertemplate="θ: %{x:.2f}<br>Loss: %{y:.2f}"))# Update layoutfig.update_layout( title=dict(text="Convergence of Gradient Descent with Momentum", x=0.5, font=dict(size=18)), xaxis_title="θ", yaxis_title="Loss J(θ)", legend_title="Legend", legend=dict( yanchor="top", y=0.99, xanchor="left", x=0.01), font=dict(size=14), template="ggplot2", width=700, height=400, margin=dict(l=10, r=10, t=40, b=20))fig.show()
Code
import numpy as npimport plotly.graph_objects as gofrom plotly.offline import init_notebook_mode# Initialize notebook mode for Plotly (if running in a Jupyter environment)init_notebook_mode(connected=True)# Generate a simple dataset for a linear regression task with 2D weightsnp.random.seed(42)X_data = np.linspace(-2, 2, 200).reshape(-1, 1)X_data = np.hstack((X_data, np.ones_like(X_data))) # Adding a bias columny_data =3* X_data[:, 0] +2+ np.random.normal(0, 0.5, X_data.shape[0]) # Linear relationship with noise# Define the loss function (Mean Squared Error) with a sinusoidal term (non-convex)def loss_function(W, X, y): preds = W[0] * X[:, 0] + W[1] * X[:, 1]return np.mean((preds - y)**2) +5* np.sin(W[0]) * np.cos(W[1])# Define the gradient of the loss functiondef gradient(W, x, y): preds = W[0] * x[:, 0] + W[1] * x[:, 1] error = preds - y grad_w0 = np.mean(2* error * x[:, 0]) +5* np.cos(W[0]) * np.cos(W[1]) grad_w1 = np.mean(2* error * x[:, 1]) -5* np.sin(W[0]) * np.sin(W[1])return np.array([grad_w0, grad_w1])# Initialize parameters for SGD and Momentum-SGDlearning_rate =0.09beta =0.45# Momentum coefficientepochs =1W_sgd = np.array([6.0, -3.]) # Initial weights for SGDW_momentum = W_sgd.copy() # Initial weights for Momentum SGD# Initialize momentum termvelocity = np.array([0.0, 0.0]) # Initial momentum (v_0)# Lists to store weight and loss values for plottingW_vals_sgd, loss_vals_sgd = [W_sgd.copy()], [loss_function(W_sgd, X_data, y_data)]W_vals_momentum, loss_vals_momentum = [W_momentum.copy()], [loss_function(W_momentum, X_data, y_data)]# Perform SGD and Momentum-based SGDfor epoch inrange(epochs): indices = np.random.permutation(len(X_data)) X_data_shuffled = X_data[indices] y_data_shuffled = y_data[indices]for i inrange(len(X_data_shuffled)): x_i = X_data_shuffled[i].reshape(1, -1) y_i = y_data_shuffled[i]# Standard SGD update grad_sgd = gradient(W_sgd, x_i, np.array([y_i])) W_sgd -= learning_rate * grad_sgd W_vals_sgd.append(W_sgd.copy()) loss_vals_sgd.append(loss_function(W_sgd, X_data, y_data))# Momentum-based SGD update grad_momentum = gradient(W_momentum, x_i, np.array([y_i])) velocity = beta * velocity + learning_rate * grad_momentum W_momentum -= velocity W_vals_momentum.append(W_momentum.copy()) loss_vals_momentum.append(loss_function(W_momentum, X_data, y_data))# Convert weight values into 3D coordinates for plottingW_vals_sgd = np.array(W_vals_sgd)loss_vals_sgd = np.array(loss_vals_sgd)x_path_sgd, y_path_sgd = W_vals_sgd[:, 0], W_vals_sgd[:, 1]z_path_sgd = loss_vals_sgdW_vals_momentum = np.array(W_vals_momentum)loss_vals_momentum = np.array(loss_vals_momentum)x_path_momentum, y_path_momentum = W_vals_momentum[:, 0], W_vals_momentum[:, 1]z_path_momentum = loss_vals_momentum# Create a meshgrid for the 3D surface of the loss functionx_range = np.linspace(-4, 8, 100)y_range = np.linspace(-4, 8, 100)X, Y = np.meshgrid(x_range, y_range)Z = np.array([loss_function([x, y], X_data, y_data) for x, y inzip(np.ravel(X), np.ravel(Y))])Z = Z.reshape(X.shape)# Plotting with Plotlyfig = go.Figure()# Add the 3D surface for the non-convex loss functionfig.add_trace(go.Surface(z=Z, x=X, y=Y, colorscale="algae", opacity=0.8))fig.update_traces(contours_z=dict(show=True, usecolormap=True, highlightcolor="black", project_z=True))# Add the SGD path as a scatter plotfig.add_trace(go.Scatter3d( x=x_path_sgd, y=y_path_sgd, z=z_path_sgd, mode='markers+lines', marker=dict(size=5, color='red'), line=dict(color='red', width=3), name="Standard SGD Path"))# Add the Momentum-SGD path as a scatter plotfig.add_trace(go.Scatter3d( x=x_path_momentum, y=y_path_momentum, z=z_path_momentum, mode='markers+lines', marker=dict(size=5, color='indigo'), line=dict(color='indigo', width=3), name="Momentum SGD Path"))# Labels and layout adjustmentsfig.update_layout( title="Comparison of SGD and Momentum-based SGD Paths", scene=dict( xaxis_title='Weight 1 (W[0])', yaxis_title='Weight 2 (W[1])', zaxis_title='Loss (MSE)', ), width=900, height=600, margin=dict(l=10, r=10, t=40, b=20), legend=dict( orientation="h", yanchor="top", y=0.99, xanchor="left", x=0.01),)fig.show()
Key Insights
Momentum is not always guaranteed to speed up convergence. The effectiveness of \(\beta\) depends on:
Loss function’s geometry: Wide, shallow basins benefit from high momentum; narrow, curved valleys may not.
Learning rate \(\eta\) : High learning rates with large \(\beta\) can cause instability.
Initialization: Starting far from the global minimum can amplify oscillations with high momentum.
theta_initial =10learning_rate =0.1beta_1 =0.9# making GD like Batch GDsteps =100# Run the Momentum Gradient Descentmomentum_gd_ewma(theta_initial, learning_rate, beta_1, steps)
Step 1: Parameter = 8.0000, Loss = 64.0000
Step 2: Parameter = 4.6000, Loss = 21.1600
Step 3: Parameter = 0.6200, Loss = 0.3844
Step 4: Parameter = -3.0860, Loss = 9.5234
Step 5: Parameter = -5.8042, Loss = 33.6887
Step 6: Parameter = -7.0897, Loss = 50.2644
Step 7: Parameter = -6.8288, Loss = 46.6322
Step 8: Parameter = -5.2282, Loss = 27.3336
Step 9: Parameter = -2.7420, Loss = 7.5184
Step 10: Parameter = 0.0440, Loss = 0.0019
e.g., A ball rolling down a shallow slope gains speed over time.
2. Reduces Oscillations in High-Curvature Regions
In regions with high curvature (sharp valleys), standard GD oscillates back and forth, slowing down convergence. Momentum smooths the updates, dampening oscillations and allowing steady progress.
3. Escapes Local Minima Faster
Local minima or saddle points can trap standard GD due to near-zero gradients. Accumulated momentum provides enough “velocity” to push out of shallow local minima or saddle points.
Gradient descent with and without momentum
4. Improves Training Stability
Standard GD is sensitive to the learning rate \(\eta\). A large learning rate causes instability; a small one slows convergence. Momentum smooths updates, reducing instability even if \(\eta\) is slightly larger. Because momentum combines gradients from multiple previous steps, the updates are less sensitive to noisy gradients in any single step.
5. Balances Speed and Precision
Momentum achieves a balance between:
Speed: Accumulating gradients accelerates movement in consistent directions.
Neural networks have complex, non-convex loss surfaces with:
Plateaus,
Saddle points,
Local minima.
How Momentum Helps:
Escapes saddle points,
Moves faster through plateaus,
Reduces oscillations in sharp valleys.
Problems with Momentum Gradient Descent ❌
1. Overshooting in Regions of Low Gradient
If the momentum builds up too much, it can cause the parameter updates to “overshoot” the minimum, especially in regions where the gradient is small.
Think of a ball rolling down a hill with momentum. If it rolls too fast, it may overshoot the bottom of the valley and climb the opposite slope.
2. Hyperparameter Sensitivity
The performance of momentum heavily depends on the choice of:
Momentum coefficient \(\beta\) (e.g., 0.9),
Learning rate \(\eta\).
Poorly chosen values can cause instability (if \(\beta\) is too large) or make momentum ineffective (if \(\beta\) is too small). Momentum requires manual tuning of the learning rate \(\eta\) and Momentum coefficient \(\beta\).
3. Directional Bias
If the optimization landscape changes significantly, momentum may “persist” in the old direction before adjusting to the new gradients. This causes inefficiency, as the updates take time to “catch up” to the correct direction.
Think like a car overshooting a turn due to its speed.
4. Slow Convergence at Saddle Points
In high-dimensional optimization (e.g., neural networks), saddle points are more common than local minima. Momentum can struggle to push the parameters out of these regions efficiently. Momentum does not inherently help escape saddle points because it relies on gradient accumulation. If gradients remain small, the updates also become small.
5. Non-Adaptive Updates
Momentum applies the same learning rate and momentum coefficient to all parameters, regardless of their individual gradients.
In neural networks, different parameters might require different learning rates (e.g., deeper layers). For example, weights in deeper layers of a neural network may require smaller updates compared to shallower layers.